Search CORE

10 research outputs found

Tagging Named Entities in 19th Century and Modern Finnish Newspaper Material with a Finnish Semantic Tagger

Author: Kettunen Kimmo Tapio
Löfberg Laura
Publication venue: 'Linkoping University Electronic Press'
Publication date: 01/05/2017
Field of study

Named Entity Recognition (NER), search, classification and tagging of names and name like informational elements in texts, has become a standard information extraction procedure for textual data during the last two decades. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent. Also used entity categories vary a lot (Nadeau and Sekine, 2007). The most general set of named entities is usually some version of three part categorization of locations, persons and corporations. In this paper we report evaluation results of NER with two different data: digitized Finnish historical newspaper collection Digi and modern Finnish technology news, Digitoday. Historical newspaper collection Digi contains 1,960,921 pages of newspaper material from years 1771–1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70–75%, and its NER evaluation collection consists of 75 931 words (Kettunen and Pääkkönen, 2016; Kettunen et al., 2016). Digitoday’s annotated collection consists of 240 articles in six different sections of the newspaper. Our new evaluated tool for NER tagging is non-conventional: it is a rule-based semantic tagger of Finnish, the FST (Löfberg et al., 2005), and its results are compared to those of a standard rule-based NE tagger, FiNER. The FST achieves up to 55–61 F-score with locations and F-score of 51–52 with persons with the historical newspaper data, and its performance is comparative to FiNER with locations. With the modern Finnish technology news of Digitoday FiNER achieves F-scores of up to 79 with locations at best. Person names show worst performance; their F-score varies from 33 to 66. The FST performs equally well as FiNER with Digitoday’s location names, but is worse with persons. With corporations, FST is at its worst, while FiNER performs reasonably well. Overall our results show that a general semantic tool like the FST is able to perform in a restricted semantic task of name recognition almost as well as a dedicated NE tagger. As NER is a popular task in information extraction and retrieval, our results show that NE tagging does not need to be only a task of dedicated NE taggers, but it can be performed equally well with more general multipurpose semantic tools.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

FiST – towards a Free Semantic Tagger of Modern Standard Finnish

Author: Kettunen Kimmo Tapio
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2019
Field of study

This paper introduces a work in progress for implementing a free full text semantic tagger for Finnish, FiST. The tagger is based on a 46 226 lexeme semantic lexicon of Finnish that was published in 2016. The basis of the semantic lexicon was developed in the early 2000s in an EU funded project Benedict (Löfberg et al., 2005). Löfberg (2017) describes compilation of the lexicon and evaluates a proprietary version of the Finnish Semantic Tagger, the FST2. The FST and its lexicon were developed using the English Semantic Tagger (The EST) of University of Lancaster as a model. This semantic tagger was developed at the University Centre for Corpus Research on Language (UCREL) at Lancaster University as part of the UCREL Semantic Analysis System (USAS3 ) framework. The semantic lexicon of the USAS framework is based on the modified and enriched categories of the Longman Lexicon of Contemporary English (McArthur, 1981). We have implemented a basic working version of a new full text semantic tagger for Finnish based on freely available components. The implementation uses Omorfi and FinnPos for morphological analysis of Finnish words. After the morphological recognition phase words from the 46K semantic lexicon are matched against the morphologically unambiguous base forms. In our comprehensive tests the lexical tagging coverage of the current implementation is around 82–90% with different text types. The present version needs still some enhancements, at least processing of semantic ambiguity of words and analysis of compounds, and perhaps also treatment of multiword expressions. Also a semantically marked ground truth evaluation collection should be established for evaluation of the tagger.Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

Open Source Tesseract in Re-OCR of Finnish Fraktur from 19th and Early 20th Century Newspapers and Journals – Collected Notes on Quality Improvement

Author: Kettunen Kimmo Tapio
Koistinen Jani Mika Olavi
Publication venue: CEUR-WS.org
Publication date: 06/03/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Creating and Using Ground Truth OCR Sample Data for Finnish Historical Newspapers and Journals

Author: Kervinen Jukka
Kettunen Kimmo Tapio
Koistinen Jani Mika Olavi
Publication venue
Publication date: 03/04/2018
Field of study

The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12.9 million pages mainly in Finnish and Swedish. Out of these about 7.36 million pages are freely available on the web site digi.kansalliskirjasto.fi. The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. The years 1920–1929 were opened in January 2018. This paper presents the ground truth Optical Character Recognition data of about 500 000 Finnish words that has been compiled at the NLF for development of a new OCR process for the collection. We discuss compilation of the data and show basic results of the new OCR process in comparison to current OCR using the ground truth data.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Research and Development Efforts on the Digitized Historical Newspaper and Journal Collection of The National Library of Finland

Author: Kettunen Kimmo Tapio
Koistinen Jani Mika Olavi
Ruokolainen Teemu Petteri
Publication venue
Publication date: 03/04/2018
Field of study

The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12.8 million pages mainly in Finnish and Swedish. Out of these about 7.36 million pages are freely available on the web site digi.kansalliskirjasto.fi (Digi). The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. This paper presents work that has been carried out in the NLF related to the historical newspaper and journal collection. We offer an overall account of research and development related to the data.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing

Author: Kettunen Kimmo Tapio
Koistinen Jani Mika Olavi
Pääkkönen Tuula Anneli
Publication venue: 'Linkoping University Electronic Press'
Publication date: 01/05/2017
Field of study

In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tesseract for Finnish historical documents. First we create a model for Finnish Fraktur fonts. Second we test Tesseract with the created Fraktur model and Antiqua model on single images and combinations of images with different image preprocessing methods. Against commercial ABBYY FineReader toolkit our method achieves 27.48% (FineReader 7 or 8) and 9.16% (FineReader 11) improvement on word level. Keywords: Optical Character Recognition, OCR Quality, Digital Image Processing, Binarization, Noise Removal, Tesseract, Finnish, Historical DocumentsPeer reviewe

Helsingin yliopiston digitaalinen arkisto

Modern Tools for Old Content - in Search of Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910

Author: Kettunen Kimmo Tapio
Kuokkala Juha Markus
Mäkelä Eetu
Niemi Jyrki Antero
Ruokolainen Teemu Petteri
Publication venue: CEUR Workshop Proceedings
Publication date: 01/01/2016
Field of study

Named entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771– 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74–75 % [2]. Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. Seco’s tools achieve 30.0–60.0 F-score with locations and persons. Performance of FiNER and SeCo’s tools with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed textNamed entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771– 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74–75 % [2]. Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. Seco’s tools achieve 30.0–60.0 F-score with locations and persons. Performance of FiNER and SeCo’s tools with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed text.Peer reviewe

Aaltodoc Publication Archive

Helsingin yliopiston digitaalinen arkisto

Kansalliskirjaston historialliset sanoma- ja aikakauslehdet avoimena digitaalisena datana - datapaketteja, rajapintoja, käyttäjiä ja tutkimusongelmia

Author: Kettunen Kimmo Tapio
Pääkkönen Tuula Anneli
Publication venue
Publication date: 01/01/2018
Field of study

Tässä artikkelissa luodaan katsaus Kansalliskirjaston digitoitujen lehtiaineistojen avoimen datan tutkimuskäyttöön. Lehtiaineistoista julkaistiin vuonna 2017 vuodet 1771–1910 kattava datapaketti, ja sen tutkimuskäytöstä on kertynyt tähän mennessä hiukan yli vuoden kokemus. Sivuamme katsauksessa myös aineiston verkkokäyttöä tutkimuksessa. Esittelemme lisäksi myös ohjelmistorajapintoja, joiden kautta aineistoihin pääsee käsiksi.Peer reviewe

Directory of Open Access Journals

Journal.fi

Helsingin yliopiston digitaalinen arkisto

National Library of Finland DSpace Services